As NBA fans, we are witnessing probably the greatest revolution ever on the court, which is called “Small ball era”. Teams tend to use small and faster players instead of traditional giants to accelerate moving speed and improve shooting efficiency.
Last year, our favorite team Knicks returned to playoff season after eight years, which brought great joy to the fans. In order to make this performance long-lasting, we feel obligated to research the key variables that contribute to the winning so that help Knicks maintain existing strengths and make up for the disadvantages. The most obvious feature of the “small ball era” is the lifting of speed and the rise of 3 point shot attempt. In this way, we focus on offensive playtype and three-point related variables along with other factors to conduct our analysis.
To ensure themselves into the playoff game, Knicks need to win enough games to become top eight in east conference. So at the very beginning, we are interested in exploring the distribution of the number of games won by NBA playoff teams to find the threshold number of wins to enter into the playoff. Then We want to figure out the contributors to total wins of a season and single win of a game. With the assumption that total wins are independent between each team and each year, and the result of a single game is independent of another, we used linear regression model to fit total wins with average performance predictors in both offensive and defensive parts. In addition, we would use our model to predict Knicks’ performance in the new season of 2021-22 to see whether it can get into play-off season. Then according to the models we build, we analyzed the performance of Knicks on key predictors to see the gap between knicks and super teams. What’s more, we deep dived these gaps from team level into player level and found how leading players should improve to get more wins. Finally, a detailed game and training strategy is proposed.
What is the threshold number of wins to enter playoff?
What variables contribute to the total wins of a season, how they impact a single game result?
If using the model we built to predict the ranking of Knicks with data in this new season, will it get into play-off?
What is the difference in performance between Knicks and league average?
1).How is the Knicks’ three shooting performance different from the league average? 2).Are there any obvious weakness in three point shooting?
As our project needs detailed stats about NBA teams and players from last 10 seasons NBA regular season, we used scrapping to get official advanced data from NBA Stats. There are four datasets we mainly used:
Advanced Box Score: In this data set, each observation represent a game and the specific data in this game, which contains the score, total field goal attempt, three point made and so on.
Playtype by Team: this data set contains average data for each team of a season in the aspect of offensive play type, such as isolation, pick and roll, ect. Each observation represent the team average data in a regular season with respect to a specific play type.
Tracking: this data set contains detailed information about NBA teams’ average movement data in a regular season, for example, passing, touches.
Knicks Shooting Log: in this data set, each observation represent a field goal that player in Knicks made, including the player who made the shot, the location they shot, the time remaining when the shot was made.
As I mentioned above, these data sets were scrapped from NBA website, the code to scrap data can be found at scrapping data
First, the datasets in NBA Stats don’t have API, so we write a function to extract them using devtools.
Then, apply the function to each dataset we want, select the variables we want and save all of them to local for tfurther use. Here is one example, box_score_all.
scrapping_data = function(url) {
headers = headers = c(
`Connection` = 'keep-alive',
`Accept` = 'application/json, text/plain, */*',
`x-nba-stats-token` = 'true',
`X-NewRelic-ID` = 'VQECWF5UChAHUlNTBwgBVw==',
`User-Agent` = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36',
`x-nba-stats-origin` = 'stats',
`Sec-Fetch-Site` = 'same-origin',
`Sec-Fetch-Mode` = 'cors',
`Referer` = 'https://stats.nba.com/players/leaguedashplayerbiostats/',
`Accept-Encoding` = 'gzip, deflate, br',
`Accept-Language` = 'en-US,en;q=0.9')
response = GET(url, add_headers(headers))
data = fromJSON(content(response, as = "text"))
df = data.frame(data$resultSets$rowSet[[1]], stringAsFactors = FALSE)
names(df) = tolower(data$resultSets$headers[[1]])
return(df)
}
drop_last_column = function(df) {
df = df %>% select(-names(df)[[length(names(df))]])
return(df)}
season_years = c("2020-21", "2019-20",
"2018-19", "2017-18",
"2016-17", "2015-16",
"2014-15", "2013-14",
"2012-13", "2011-12",
"2010-11", "2009-10",
"2008-09", "2007-08",
"2006-07", "2005-06",
"2004-05", "2003-04",
"2002-03", "2001-02")
box_score_all = tibble(
season_year = season_years,
url = str_c("https://stats.nba.com/stats/teamgamelogs?DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlusMinus=N&Rank=N&Season=", season_year, "&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&VsConference=&VsDivision="),
box_score = map(url, scrapping_data)) %>%
mutate(box_score = map(box_score, drop_last_column)) %>% # last column of each box score is NA
select(-season_year, -url) %>%
unnest(cols = box_score)
write_csv(box_score_all, "./data2/box_score_all.csv")
Next, we read them and select related variables. All the raw datasets are as following:
box_score_all = read_csv("./data2/box_score_all.csv") %>%
janitor::clean_names() %>%
select(-contains("rank"))
pass_df =
read_csv("./data2/pass_df.csv") %>%
select(season_year, team_abbreviation, passes_made)
isol_df =
read_csv("./data2/isol_df.csv") %>%
select(season_year, team_abbreviation, poss) %>%
rename(poss_iso = poss)
prbh_df =
read_csv("./data2/prbh_df.csv") %>%
select(season_year, team_abbreviation, poss) %>%
rename(poss_prb = poss)
prrm_df =
read_csv("./data2/prrm_df.csv") %>%
select(season_year, team_abbreviation, poss) %>%
rename(poss_prr = poss)
defend_df =
read_csv("./data2/defensive_impact_df.csv") %>%
select(season_year, team_abbreviation, stl, blk, dreb)
trans_df =
read_csv("./data2/transition_df.csv") %>%
select(season_year, team_abbreviation, poss) %>%
rename(poss_trans = poss)
After read these half raw dataset, we need to prepare four dataframes for exploring data analysis, fit regression model.
The avg_df contains the number of wins by team and year in the last 20 years, which is used to analyse the distribution of threshold wins, and to further generate predict_df for regression. As data in box_score_all is documented by game, we need to summarise it into total and average stats. Detailed wrangling process:
1).select the variables we are interested from box_score_all.
2).define four new variables. * win: 1 means winning and 0 means lost.
* game_num: constant 1 used to sum the number of games.
* fg3a_p: the percentage of 3 point field goal attempt.
* conference: west or east
3). * calculate total number of wins, total number of games, total number of field goal attempt, total number of field goal made, total number of 3 points field goal attempt and total number of 3 points field goal made by team and season.
* calculate average number of points, average number of assists and average number of turnovers by team and season.
* revised the number of wins in strike season and COVID-19 season.
* locate key variables at the first.
* arrange all data by season year and number of wins.
* calculate the rank in east or west conference by team and year
* marked whether or not a team entered into the playoff in that year
avg_df =
box_score_all %>%
select(season_year, team_abbreviation, wl, pts, ast, tov, fgm, fga, fg3m, fg3a) %>%
mutate(
win = case_when(wl == "W" ~ 1, TRUE~0),
game_num = 1,
fg3a_p = round(fg3a/fga, digits = 3),
team_abbreviation = str_replace(team_abbreviation, "NOH", "NOP"),
team_abbreviation = str_replace(team_abbreviation, "NJN", "BKN"),
conference = case_when(
team_abbreviation %in% c("UTA","PHX","LAC","DEN","DAL","LAL","POR","GSW","SAS","MEM","NOP","SAC","MIN","OKC","HOU","SEA","NOK","CHH")~"west",
team_abbreviation %in% c("PHI","BKN","MIL","ATL","NYK","MIA","BOS","IND","WAS","CHI","TOR","CLE","ORL","DET","CHA")~"east") # divide into east and west conference
) %>%
group_by(season_year, team_abbreviation, conference) %>%
summarise(
wins = sum(win),
games = sum(game_num),
games_should = 82,
pts_avg = round(mean(pts), digits = 1),
ast_avg = round(mean(ast), digits = 1),
tov_avg = round(mean(tov), digits = 1),
fgm_total = sum(fgm),
fga_total = sum(fga),
fg3m_total = sum(fg3m),
fg3a_total = sum(fg3a)
) %>%
mutate(wins_revised = round(wins/games*games_should,0)) %>% # due to labor negotiation in 2011-12, COVID-19.
relocate(season_year, team_abbreviation, conference, wins, wins_revised, everything()) %>%
arrange(desc(season_year),desc(wins)) %>%
mutate(fg3_p = fg3a_total/fga_total, fg3_r = fg3m_total/fg3a_total) %>%
group_by(season_year,conference) %>%
mutate(
conf_rank = row_number(),
play_off_team = case_when(
conf_rank <= 8 ~ "playoff",
conf_rank > 8 ~ "non-playoff"
),
play_off_team = fct_relevel(play_off_team, c("playoff", "non-playoff")))
The predict_df contain the average performance data with total number of games won in the last 8 years, which is used to build models and predict the number of winnings.
Not only do we include the fundamental average stats like points, steals, blocks and turnovers from avg_df, but also we want to include playtype data and defensive data in the model. Thus, we combine the avg_df with 6 other dataframes.
predict_df =
avg_df %>%
left_join(defend_df, by = c("season_year","team_abbreviation")) %>%
left_join(prrm_df, by = c("season_year","team_abbreviation")) %>%
left_join(prbh_df, by = c("season_year","team_abbreviation")) %>%
left_join(isol_df, by = c("season_year","team_abbreviation")) %>%
left_join(pass_df, by = c("season_year","team_abbreviation")) %>%
left_join(trans_df, by = c("season_year","team_abbreviation")) %>%
drop_na(poss_trans, passes_made, poss_iso, poss_prb, poss_prr, stl, blk, dreb) %>%
mutate(
poss_pr = poss_prr + poss_prb
) %>%
select(-poss_prr, -poss_prb, -wins, -games, -games_should, -fgm_total, -fga_total)
This dataframe contains 23476 observations in the last 8 years which is mainly used for analyzing the tendency of different stats between palyoff teams and non-playoff teams.
To add the rank of a team, we joined it with the conf_rank in avg_df.
box_score_viz =
box_score_all %>%
filter(season_year %in% c("2011-12", "2012-13", "2013-14", "2014-15", "2015-16", "2016-17", "2017-18", "2018-19", "2019-20", "2020-21")) %>%
mutate(team_abbreviation = str_replace(team_abbreviation, "NOH", "NOP"),
team_abbreviation = str_replace(team_abbreviation, "NJN", "BKN")) %>%
select(season_year, team_abbreviation, wl, pts, ast, tov, fgm, fga, fg3m, fg3a, stl, blk, dreb) %>%
mutate(
win = case_when(wl == "W" ~ 1, TRUE~0),
game_num = 1,
conference = case_when(
team_abbreviation %in% c("UTA","PHX","LAC","DEN","DAL","LAL","POR","GSW","SAS","MEM","NOP","SAC","MIN","OKC","HOU","SEA","NOK","CHH")~"west",
team_abbreviation %in% c("PHI","BKN","MIL","ATL","NYK","MIA","BOS","IND","WAS","CHI","TOR","CLE","ORL","DET","CHA")~"east"), # divide into east and west conference
fg3a_p = round(fg3a/fga, digits = 3),
fg3_r = round(fg3m/fg3a, digits = 3)
) %>%
relocate(season_year, team_abbreviation, conference)
conf_rank =
avg_df %>%
filter(season_year %in% c("2011-12", "2012-13", "2013-14", "2014-15", "2015-16", "2016-17", "2017-18", "2018-19", "2019-20", "2020-21")) %>%
ungroup() %>%
select(season_year, team_abbreviation, conference, conf_rank)
#join the two table together
box_score_viz =
box_score_viz %>%
left_join(conf_rank, by = c("season_year", "team_abbreviation", "conference")) %>%
mutate(play_off_team = case_when(
conf_rank <= 8 ~ "playoff",
conf_rank > 8 ~ "non-playoff"
),
play_off_team = fct_relevel(play_off_team, c("playoff", "non-playoff")),
fg3p = fg3m / fg3a) %>%
relocate(season_year, team_abbreviation, conference, play_off_team)
This dataframe contains stats per game, which is used for logistic regression. We exclude useless variables here and change win or lose to a factor variable.
regre_df =
box_score_all %>%
select(-c(1:7)) %>%
select(-ends_with("rank")) %>%
mutate(wl = recode(wl, "W" = 1, "L" = 0),
wl = as.factor(wl))
We are going to do two regression model with the data above. One is to fit the number of wins of a season with average performance. The other is to fit win or loss of a single game with the data in a single game.
1.Predict the number of wins
Dependent variable is the number of wins by team and season, denoted by wins_revised.
Independent variables are selected from both offensive aspect and defensive part.
The typical attributes of “small ball era” is more three points shooting and quicker speed. So we select the following variables representing offensive level of a team: * fg3_p: proportion of three points shooting * fg3_r: three points shooting rate * pts_avg: average points per game * tov_avg: average number of turnovers per game * ast_avg: average number of assists per game * poss_trans: average number of transitions * passes_made: average number of passes per game * poss_iso: average number of isolations per game * poss_pr: average number of pick and rolls
As for the defensive level, variables include:
2.Predict the win or lose of a game
These are some reasonable variables that should be added into the regression model:
In this part, we explore that on which variables, there would be difference between teams that get into play-off season and teams that not. In this way, we can get some insight on choosing potential parameters for model building. Specifically, we identify the trend in three point attempt by time in the past 10 seasons
Recall “threshold”: the number of games won by the 8th team in both west and east conference.
Plot the threshold over last 20 years.
eighth_wins =
avg_df %>%
filter(conf_rank == 8)
ggplot(data = eighth_wins,aes(x = wins_revised)) +
geom_bar() +
theme_bw()

summary(eighth_wins$wins_revised)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.00 39.00 42.00 42.35 45.00 50.00
There are 40 observations of threshold in the past 20 years, which follows a normal distribution with mean 42.35 and variance 15.4641026.
Then we are going to deep dive the factors associated with the number of wins per season for a team and try to find significant contributors to increase the number of wins.
Firstly, We wanted to look at how the scores of each play distributed in the last 10 seasons from the aspects of team which got into play-off season and team who didn’t.From the figure below.
non_play_off =
box_score_viz %>%
filter(play_off_team == "non-playoff")
box_score_viz %>%
filter(play_off_team == "playoff") %>%
ggplot(aes(x = pts, y = season_year)) +
geom_density_ridges(scale = .8, alpha = .5, fill = "blue",
quantile_lines = T, quantile_fun = mean) +
geom_density_ridges(data = non_play_off, aes(x = pts, y = season_year),
scale = .8, alpha = .5, fill = "salmon",
quantile_lines = T, quantile_fun = mean) +
scale_fill_manual(name = "Team", values = cols) +
xlim(65, 140) +
labs(x = "Scores",
y = "Season Year",
title = "Score Distribution Between Playoff and Regular Season Team")

Two things are obvious.
Firstly, in the last 10 regular seasons, score of each play displays an increasing trend. Secondly, team who got into the playoff season have higher average scores compared to team who did not get into playoff season.
Its easy to understand the second tendency that playoff teams outscore the non-plyoff ones because higher scores let them win more. As for the rising of average score for all NBA teams, that is due to the small ball revolution, in which teams are going to speed up, get more shooting chances and increase the percentage of three points shooting.
Next, we use the Boxscore data and team average data to deep dive potential variables that contribute to the wining of plays.
It is clear that the percentage of three point field goal attempt in all field goal attempt were increasing in the last 10 seasons, which corresponds to the phenomenon of “Small Ball Revolution” and our analysis that score of each play was increasing in last 10 regular seasons. On the other hand, team who got into playoff season have more three point shooting attempt during a game, which means the three point shooting attempt percent might be a contributor to the number of game wins.
It is also apparent that the three point shooting rate is higher among playoff teams than non-playoff teams. That is because high shooting rate corresponds to higher scores of a game. Another tendency from this plot is that the variation of three point shooting rate narrows down. That reflects the attention that teams paid to three point shooting. If players were trained more on shooting, their shooting would become more stable.
plot_ly(box_score_viz, x = ~ season_year, y = ~ fg3a_p, color = ~ play_off_team, type = "box") %>%
layout(boxmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Three Field Goal Attempt Percent"))
plot_ly(box_score_viz, x = ~ season_year, y = ~ fg3p, color = ~ play_off_team, type = "box") %>%
layout(boxmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Three Pointer Rate"))
Then, we are going to explore the influence that playtypes have on the average wins. If a playtype can apparently contribute to the number of wins of a team, we would suggest Knicks to design more offense in that type.
The average isolations per game in playoff teams are almost rqual from 2013-14 seasaon to now, while the average isolations per game for non-playoff teams tended to decrease overtime. Super star group might account for this phenomenon, because super stars are able to conduct more isolation. As super stars joined the playoff team, the isolation of non-playoff teams decreased.
Pick and roll is a common offensive team work. We can see from Pick and Roll plot that the average pick and rolls per game tended to increase in the last 8 years, and that of playoff teams was lower than that of non-playoff teams, which matched the phenomenon of isolation a lot.
Transition means the defensive team immediately launches a fast break after getting the rebound or stealing the ball without waiting for the new defensive team to be seated. It is an important way to speed up and score easily and quickly. Average transitions rose gradually because it is more efficient. And non-playoff teams seemed to conduct more transitions than playoff teams. But that didn’t mean more transitions less wins, instead it was likely that due to the team was non-playoff team, it has lower power in seated offense thus they tended to do more transitions.
play_tp_df %>%
group_by(season_year, play_off_team) %>%
summarise(iso_mean = mean(poss_iso)) %>%
mutate(text_label = str_c("Team Type: ", play_off_team,
"\nAverage Isolation: ", round(iso_mean, 2))) %>%
plot_ly(x = ~ season_year, y = ~ iso_mean, type = "bar",
color = ~ play_off_team, text = ~text_label, colors = "viridis") %>%
layout(barmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Average Isolation"))
play_tp_df %>%
group_by(season_year, play_off_team) %>%
summarise(pr_mean = mean(poss_pr)) %>%
mutate(text_label = str_c("Team Type: ", play_off_team,
"\nAverage Pick and Roll: ", round(pr_mean, 2))) %>%
plot_ly(x = ~ season_year, y = ~ pr_mean, type = "bar",
color = ~ play_off_team, text = ~text_label, colors = "viridis") %>%
layout(barmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Average Pick and Roll"))
play_tp_df %>%
group_by(season_year, play_off_team) %>%
summarise(trans_mean = mean(poss_trans)) %>%
mutate(text_label = str_c("Team Type: ", play_off_team,
"\nAverage Transition: ", round(trans_mean, 2))) %>%
plot_ly(x = ~ season_year, y = ~ trans_mean, type = "bar",
color = ~ play_off_team, text = ~text_label, colors = "viridis") %>%
layout(barmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Average Transition"))
Block is a key defense parameter. Higher blocks mean that your opponents have lower chance to score on you. Playoff teams played better on blocks than non-playoff teams.
Steal is also a defense parameter, which is accompanied by turnovers of opponents. There was no apparent tendency in steal over time.
Too many turnovers would let a team lose a game. The turnover plot shows that the average turnovers in playoff teams were lower than the average turnovers in non-playoff teams.
Defensive rebounds could prevent the opponent’s second attack so that reduce its scores. As we can see, palyoff teams could grab more defensive rebounds than non-playoff teams.
The number of passing per game reflect the offense fluency. Adequate number of passes could bring create good shooting opportunities, but no good shooting opportunity created after too many passes represents bad offense ability. From the passes plot, non-playoff teams had higher average passes per game than playoff teams.
avg_viz_df %>%
group_by(season_year, play_off_team) %>%
summarise(blk_mean = mean(blk)) %>%
mutate(text_label = str_c("Team Type: ", play_off_team,
"\nAverage Block: ", round(blk_mean, 2))) %>%
plot_ly(x = ~ season_year, y = ~ blk_mean, type = "bar",
color = ~ play_off_team, text = ~text_label, colors = "viridis") %>%
layout(barmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Average Steal"))
avg_viz_df %>%
group_by(season_year, play_off_team) %>%
summarise(stl_mean = mean(stl)) %>%
mutate(text_label = str_c("Team Type: ", play_off_team,
"\nAverage Steal: ", round(stl_mean, 2))) %>%
plot_ly(x = ~ season_year, y = ~ stl_mean, type = "bar",
color = ~ play_off_team, text = ~text_label, colors = "viridis") %>%
layout(barmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Average Steal"))
avg_viz_df %>%
group_by(season_year, play_off_team) %>%
summarise(tov_mean = mean(tov_avg)) %>%
mutate(text_label = str_c("Team Type: ", play_off_team,
"\nAverage Turnover: ", round(tov_mean, 2))) %>%
plot_ly(x = ~ season_year, y = ~ tov_mean, type = "bar",
color = ~ play_off_team, text = ~text_label, colors = "viridis") %>%
layout(barmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Average Turnover"))
avg_viz_df %>%
group_by(season_year, play_off_team) %>%
summarise(dreb_mean = mean(dreb)) %>%
mutate(text_label = str_c("Team Type: ", play_off_team,
"\nAverage Defensive Rebound: ", round(dreb_mean, 2))) %>%
plot_ly(x = ~ season_year, y = ~ dreb_mean, type = "bar",
color = ~ play_off_team, text = ~text_label, colors = "viridis") %>%
layout(barmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Average Turnover"))
avg_viz_df %>%
group_by(season_year, play_off_team) %>%
summarise(passes_mean = mean(passes_made)) %>%
mutate(text_label = str_c("Team Type: ", play_off_team,
"\nAverage Passes: ", round(passes_mean, 2))) %>%
plot_ly(x = ~ season_year, y = ~ passes_mean, type = "bar",
color = ~ play_off_team, text = ~text_label, colors = "viridis") %>%
layout(barmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Average Passes"))
In this part, we explore each game in the past 20 years, try to find some important variables that might affect the result of the game. By this process, we also can get some insight on choosing potential parameters for model building.
Firstly, we take a look on the scores difference between the winning team and losing team in the past 20 years.
We can see that if a team wants to win the game, the score they needs to achieve become much higher compared to the past years. The average score for the winning team has some up and down form 2001-2015 seasons, however, after entering the small ball revolution, the average score for winning keep increase and never fall down since 2015-16 season.
So, it is obviously that if a team wants to win a game, they need to find a new techniques to earn more score. Next we will explore some factors we think might play a rule on the result of the game.
lose_game=
box_score_all %>%
filter(wl == "L")
box_score_all %>%
filter(wl == "W") %>%
ggplot(aes(x = pts, y = season_year)) +
geom_density_ridges(scale = .8, alpha = .5, fill = "blue",
quantile_lines = T, quantile_fun = mean) +
geom_density_ridges(data = lose_game, aes(x = pts, y = season_year),
scale = .8, alpha = .5, fill = "salmon",
quantile_lines = T, quantile_fun = mean) +
scale_fill_manual(name = "Team", values = cols) +
xlim(65, 140) +
labs(x = "Scores",
y = "Season Year",
title = "Score Distribution Win and Lose game")

As a team needs to gain more score for winning the game, to analyze the factors of game result, we first look at some variables that directly have influence on score. We put both the plot of percentage and attempted together, so we can observe NBA’s trend of scoring strategy in these 20 years.
First, we can see that although the field goal attempted and percentage didn’t seems to have much change through these 20 years, the winning team have much stronger field goal percentage compare the team who lose. Also, we can see that the losing team have a slightly more field goal attempted than the winning team.
Secondly, we can see that the 3 point field goals’s percentage didn’t change much in these two decades.However, there is a really significant increase on the 3 point field goals attempted. After the small ball era at 2015, the 3 point field goals attempted grow up remarkably, also we can see that same pattern as field goal attempted, the losing team also have higher 3 point field goals attempted.
Third, we can see that free throw attempted and percentage didn’t have much change through these 20 years. The winning team have higher attempted and percentage.
By inspect these variables, we can conclude that on the basketball field there are almost like a three points field goals fight after 2015, everyone throw as much as 3 points play as they can. The most notable difference of attempted between winning team and losing team happened on the free throw attempted, this means that even that the free throw only contribute one point in the score, it still is a indicator of the game result.And the most outstanding difference of percentage between winning team and losing team happened on the field goal percentage, this suggest that the to improve field goal percentage is one of the most critcal thing a team should consider if they want to win the game.
So since 3 points field attempted become trend in every game, we look up three of the most higher 3 points field attempted out liner on the plot. We find out it all made by Houston Rockets, so we took a look deeply, we find out in the top ten of the highest higher 3 points field in these 20 years, Houston Rockets occupy 8 of it and the other 2 is Atlanta Hawks.
plot_ly( box_score_all, x = ~ season_year, y = ~ fga , color = ~ wl, type = "box") %>%
layout(boxmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Field Goal Attempted"))
plot_ly( box_score_all, x = ~ season_year, y = ~ fg_pct, color = ~ wl, type = "box") %>%
layout(boxmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Field Goal Percentage"))
plot_ly(box_score_all, x = ~ season_year, y = ~ fg3a, color = ~ wl, type = "box") %>%
layout(boxmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "3 Point Field Goals Attempted"))
plot_ly(box_score_all, x = ~ season_year, y = ~ fg3_pct, color = ~ wl, type = "box") %>%
layout(boxmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "3 Point Field Goals Percentage"))
plot_ly(box_score_all, x = ~ season_year, y = ~ fta, color = ~ wl, type = "box") %>%
layout(boxmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Free Throw Attempted"))
plot_ly(box_score_all, x = ~ season_year, y = ~ ft_pct, color = ~ wl, type = "box") %>%
layout(boxmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Free Throw Percent"))
nyk=
box_score_all %>%
filter(team_abbreviation =="NYK")
plot_ly(nyk, x = ~ season_year, y = ~ fg3a, color = ~ wl, type = "box") %>%
layout(boxmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "3 Point Field Goals Attempted"))
plot_ly(nyk, x = ~ season_year, y = ~ fg3_pct, color = ~ wl, type = "box") %>%
layout(boxmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "3 Point Field Goals Percentage"))
Next, we are going to explore the influence of some offensive strategies on the basketball field, to see what kind of techniques might play a role on the result of the game.
The average offensive rebounds is slightly higher in the losing team. And in the average assists per game the winning team is significantly higher.
Average offensive rebounds are higher in the losing team seems like a same pattern as the attempted also higher in the losing team. We can hypothesize that when a team started to lose they will take more aggressive strategies compared to the team who keep leading .
Average assists per games is significantly higher in the winning team might result from the assists is defined as scoring successfully, and more scoring means more possible to win.
offensive_df %>%
group_by(season_year, wl) %>%
summarise(oreb= mean(oreb)) %>%
plot_ly(x = ~ season_year, y = ~ oreb, type = "bar",
color = ~wl, colors = "viridis") %>%
layout(barmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Average offensive rebounds of each game"))
offensive_df %>%
group_by(season_year, wl) %>%
summarise(ast= mean(ast)) %>%
plot_ly(x = ~ season_year, y = ~ ast, type = "bar",
color = ~wl, colors = "viridis") %>%
layout(barmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Aaverage assists of each game"))
In this part, we are going to explore the influence of some defensive level strategies on the basketball field, to see what kind of defensive techniques might play a role on the result of the game.
Steals, Blocks, Defensive rebounds of each game is significantly higher in the winning team, Personal foul and Turnovers of each game are slightly higher in the losig team.
box_score_all %>%
group_by(season_year, wl) %>%
summarise(stl= mean(stl)) %>%
plot_ly(x = ~ season_year, y = ~ stl, type = "bar",
color = ~wl, colors = "viridis") %>%
layout(barmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Aaverage steals of each game"))
box_score_all %>%
group_by(season_year, wl) %>%
summarise(blk= mean(blk)) %>%
plot_ly(x = ~ season_year, y = ~ blk, type = "bar",
color = ~wl, colors = "viridis") %>%
layout(barmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Aaverage blocks of each game"))
box_score_all %>%
group_by(season_year, wl) %>%
summarise(dreb= mean(dreb)) %>%
plot_ly(x = ~ season_year, y = ~ dreb, type = "bar",
color = ~wl, colors = "viridis") %>%
layout(barmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Aaverage defensive rebounds of each game"))
box_score_all %>%
group_by(season_year, wl) %>%
summarise(tov= mean(tov)) %>%
plot_ly(x = ~ season_year, y = ~ tov, type = "bar",
color = ~wl, colors = "viridis") %>%
layout(barmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Aaverage turnovers of each game"))
box_score_all %>%
group_by(season_year, wl) %>%
summarise(pf= mean(pf)) %>%
plot_ly(x = ~ season_year, y = ~ pf, type = "bar",
color = ~wl, colors = "viridis") %>%
layout(barmode = "group",
xaxis = list(title = 'Season Year'),
yaxis = list(title = "Aaverage personal foul of each game"))
In this part, we are going to use linear model to quantify the relationship between the number of wins and average performance parameters. Further, to know the factors influencing the result of a single game, we also use logistic regression to fit the box score data.
predict_df %>%
select(-fg3a_total, -fg3m_total, -play_off_team, -conf_rank) %>%
ggpairs(columns = 4:16)

The pts_avg is kind of correlated with ast_avg, fg3_p, dreb and poss_trans. So this The correlation between predictors are not very high, which is important for preventing collinearity.
We used backward elimination method to select the significant dependents.
Firstly, put all the potential variables into the linear model, to see the regression results.
model1 = lm(data = predict_df, wins_revised ~ pts_avg + tov_avg + fg3_p + fg3_r + stl + blk + dreb + poss_trans + poss_iso + poss_pr + ast_avg + passes_made)
summary(model1)
##
## Call:
## lm(formula = wins_revised ~ pts_avg + tov_avg + fg3_p + fg3_r +
## stl + blk + dreb + poss_trans + poss_iso + poss_pr + ast_avg +
## passes_made, data = predict_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.0773 -5.2861 -0.3047 4.9906 19.5613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -175.99049 25.58210 -6.879 5.78e-11 ***
## pts_avg 0.42485 0.23003 1.847 0.066055 .
## tov_avg -1.94122 0.57824 -3.357 0.000923 ***
## fg3_p -20.42731 11.77745 -1.734 0.084197 .
## fg3_r 279.42950 37.76560 7.399 2.64e-12 ***
## stl 5.83534 0.85794 6.802 9.07e-11 ***
## blk 1.06911 0.81417 1.313 0.190466
## dreb 3.31563 0.45854 7.231 7.27e-12 ***
## poss_trans -1.18011 0.30874 -3.822 0.000171 ***
## poss_iso 0.36546 0.32999 1.107 0.269258
## poss_pr -0.78945 0.18741 -4.212 3.64e-05 ***
## ast_avg -0.51355 0.47089 -1.091 0.276608
## passes_made -0.02401 0.03490 -0.688 0.492183
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.023 on 227 degrees of freedom
## Multiple R-squared: 0.598, Adjusted R-squared: 0.5767
## F-statistic: 28.14 on 12 and 227 DF, p-value: < 2.2e-16
anova(model1)
## Analysis of Variance Table
##
## Response: wins_revised
## Df Sum Sq Mean Sq F value Pr(>F)
## pts_avg 1 6062.9 6062.9 94.1910 < 2.2e-16 ***
## tov_avg 1 1845.1 1845.1 28.6643 2.104e-07 ***
## fg3_p 1 635.3 635.3 9.8694 0.001904 **
## fg3_r 1 5819.1 5819.1 90.4038 < 2.2e-16 ***
## stl 1 1028.0 1028.0 15.9708 8.698e-05 ***
## blk 1 1244.2 1244.2 19.3296 1.690e-05 ***
## dreb 1 2544.2 2544.2 39.5263 1.641e-09 ***
## poss_trans 1 306.3 306.3 4.7581 0.030188 *
## poss_iso 1 1022.1 1022.1 15.8790 9.101e-05 ***
## poss_pr 1 1101.4 1101.4 17.1117 4.965e-05 ***
## ast_avg 1 95.0 95.0 1.4760 0.225662
## passes_made 1 30.5 30.5 0.4733 0.492183
## Residuals 227 14611.5 64.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The adjusted R square for the full model is 0.5767 that is to say 57.67% of variances in the response variable can be explained by the predictors.
Then, to get a better model with higher adjusted R square, we delete the less effective predictors with higher p-value, which is passes_made
model2 = lm(data = predict_df, wins_revised ~ pts_avg + tov_avg + fg3_p + fg3_r + stl + blk + dreb + poss_trans + poss_iso + poss_pr + ast_avg)
summary(model2)
##
## Call:
## lm(formula = wins_revised ~ pts_avg + tov_avg + fg3_p + fg3_r +
## stl + blk + dreb + poss_trans + poss_iso + poss_pr + ast_avg,
## data = predict_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.4674 -5.2588 -0.2532 4.8497 19.7248
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -185.2087 21.7670 -8.509 2.36e-15 ***
## pts_avg 0.4602 0.2239 2.055 0.040997 *
## tov_avg -2.0249 0.5646 -3.586 0.000410 ***
## fg3_p -22.0535 11.5245 -1.914 0.056921 .
## fg3_r 278.0997 37.6725 7.382 2.90e-12 ***
## stl 5.8154 0.8565 6.790 9.61e-11 ***
## blk 1.0002 0.8071 1.239 0.216493
## dreb 3.3142 0.4580 7.236 6.97e-12 ***
## poss_trans -1.1359 0.3016 -3.766 0.000211 ***
## poss_iso 0.4524 0.3044 1.486 0.138627
## poss_pr -0.7539 0.1799 -4.190 3.99e-05 ***
## ast_avg -0.5649 0.4644 -1.216 0.225122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.014 on 228 degrees of freedom
## Multiple R-squared: 0.5971, Adjusted R-squared: 0.5777
## F-statistic: 30.72 on 11 and 228 DF, p-value: < 2.2e-16
anova(model2)
## Analysis of Variance Table
##
## Response: wins_revised
## Df Sum Sq Mean Sq F value Pr(>F)
## pts_avg 1 6062.9 6062.9 94.4091 < 2.2e-16 ***
## tov_avg 1 1845.1 1845.1 28.7307 2.034e-07 ***
## fg3_p 1 635.3 635.3 9.8922 0.001881 **
## fg3_r 1 5819.1 5819.1 90.6131 < 2.2e-16 ***
## stl 1 1028.0 1028.0 16.0078 8.530e-05 ***
## blk 1 1244.2 1244.2 19.3744 1.651e-05 ***
## dreb 1 2544.2 2544.2 39.6179 1.566e-09 ***
## poss_trans 1 306.3 306.3 4.7691 0.029995 *
## poss_iso 1 1022.1 1022.1 15.9157 8.926e-05 ***
## poss_pr 1 1101.4 1101.4 17.1513 4.863e-05 ***
## ast_avg 1 95.0 95.0 1.4794 0.225122
## Residuals 228 14641.9 64.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The adjusted R square got improved to 0.5777. 57.77% of variances in the response variable can be explained by the predictors. Then delete the ast_avg which has the highest p-value among the variables left to see if the adjusted R square could be better.
model3 = lm(data = predict_df, wins_revised ~ pts_avg + tov_avg + fg3_p + fg3_r + stl + blk + dreb + poss_trans + poss_iso + poss_pr)
summary(model3)
##
## Call:
## lm(formula = wins_revised ~ pts_avg + tov_avg + fg3_p + fg3_r +
## stl + blk + dreb + poss_trans + poss_iso + poss_pr, data = predict_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.6211 -5.1850 -0.0912 5.1496 18.9506
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -185.1802 21.7898 -8.498 2.48e-15 ***
## pts_avg 0.3252 0.1947 1.670 0.096204 .
## tov_avg -2.0556 0.5647 -3.640 0.000337 ***
## fg3_p -22.3064 11.5347 -1.934 0.054363 .
## fg3_r 270.4524 37.1830 7.274 5.52e-12 ***
## stl 5.6946 0.8516 6.687 1.72e-10 ***
## blk 0.9850 0.8078 1.219 0.223953
## dreb 3.3423 0.4579 7.299 4.72e-12 ***
## poss_trans -1.1494 0.3017 -3.809 0.000179 ***
## poss_iso 0.6971 0.2288 3.047 0.002583 **
## poss_pr -0.6436 0.1556 -4.137 4.94e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.022 on 229 degrees of freedom
## Multiple R-squared: 0.5945, Adjusted R-squared: 0.5768
## F-statistic: 33.58 on 10 and 229 DF, p-value: < 2.2e-16
anova(model3)
## Analysis of Variance Table
##
## Response: wins_revised
## Df Sum Sq Mean Sq F value Pr(>F)
## pts_avg 1 6062.9 6062.9 94.2119 < 2.2e-16 ***
## tov_avg 1 1845.1 1845.1 28.6707 2.083e-07 ***
## fg3_p 1 635.3 635.3 9.8716 0.00190 **
## fg3_r 1 5819.1 5819.1 90.4238 < 2.2e-16 ***
## stl 1 1028.0 1028.0 15.9743 8.661e-05 ***
## blk 1 1244.2 1244.2 19.3339 1.681e-05 ***
## dreb 1 2544.2 2544.2 39.5351 1.614e-09 ***
## poss_trans 1 306.3 306.3 4.7592 0.03016 *
## poss_iso 1 1022.1 1022.1 15.8825 9.062e-05 ***
## poss_pr 1 1101.4 1101.4 17.1155 4.942e-05 ***
## Residuals 229 14736.9 64.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The adjusted R square got decreased to 0.5768, so the ast_avg should be kept in the model. Model2 is the final model.
With respect to the above three models, we want to see which model has the best generalizability. So in this section, cross validation is used to compare candidate model
set.seed(1000)
predict_cv_df =
predict_df %>%
crossv_mc(100) %>%
mutate(train = map(train, as.tibble),
test = map(test, as.tibble))
predict_cv_df =
predict_cv_df %>%
mutate(model1 = map(train, ~lm(wins_revised ~ pts_avg + ast_avg + tov_avg + fg3_p + fg3_r + stl + blk + dreb + poss_trans + passes_made + poss_iso + poss_pr, data = .x)),
model2 = map(train, ~lm(wins_revised ~ pts_avg + ast_avg + tov_avg + fg3_p + fg3_r + stl + blk + dreb + poss_trans + poss_iso + poss_pr, data = .x)),
model3 = map(train, ~lm(wins_revised ~ pts_avg + tov_avg + fg3_p + fg3_r + stl + blk + dreb + poss_trans + poss_iso + poss_pr, data = .x))) %>%
mutate(rmse1 = map2_dbl(model1, test, ~rmse(model = .x, data = .y)),
rmse2 = map2_dbl(model2, test, ~rmse(model = .x, data = .y)),
rmse3 = map2_dbl(model3, test, ~rmse(model = .x, data = .y)))
predict_cv_df %>%
select(starts_with("rmse")) %>%
pivot_longer(everything(),
names_to = "model",
names_prefix = "rmse",
values_to = "rmse") %>%
ggplot(aes(x = model, y = rmse, fill = model)) +
geom_boxplot(alpha = .6)
The results rmse distribution of the three model are very similar to each other, which indicates similar level of generalizability. Therefore, we still use the model2 for the final model.
We can see that Residuals vs Fitted is approximately normally distributed around 0. On the other hand, heteroscedasticity is not a problem in this model. And there is no out-lier that have big impact on the model fit.
par(mfrow=c(2,2))
plot(model2, which = 1)
plot(model2, which = 2)
plot(model2, which = 3)
plot(model2, which = 4)

Final model is model 2.
model2 %>% broom::tidy()
## # A tibble: 12 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -185. 21.8 -8.51 2.36e-15
## 2 pts_avg 0.460 0.224 2.06 4.10e- 2
## 3 tov_avg -2.02 0.565 -3.59 4.10e- 4
## 4 fg3_p -22.1 11.5 -1.91 5.69e- 2
## 5 fg3_r 278. 37.7 7.38 2.90e-12
## 6 stl 5.82 0.856 6.79 9.61e-11
## 7 blk 1.00 0.807 1.24 2.16e- 1
## 8 dreb 3.31 0.458 7.24 6.97e-12
## 9 poss_trans -1.14 0.302 -3.77 2.11e- 4
## 10 poss_iso 0.452 0.304 1.49 1.39e- 1
## 11 poss_pr -0.754 0.180 -4.19 3.99e- 5
## 12 ast_avg -0.565 0.464 -1.22 2.25e- 1
All variables selected are significant in this linear regression model.
For each 2 additional point increase, there will be 1 extra wins.
For each 1 additional average turnover, there will be 2 extra lose, which hurts badly.
For each 5% additional increase in the proportion of three points attempt, there will be about 1 extra lose.
For each 1% additional increase in three points shooting rate, there will be 2.8 extra wins, which helps the team a lot.
For each 0.5 additional steal per game, there will be 3 extra wins, which is crazy. But average steal is below 10, and it is not easy to improve this ability.
For each additional average block, there will be 1 extra win.
For each additional average defensive rebound, there will be 3.31 extra wins! Thus, protecting the defensive rebound well is critical.
For each additional transition per game, there will be 1.14 extra lose.
For each additional isolation per game, there will be 0.45 extra win.
For each additional pick and roll per game, there will be 0.75 extra lose.
For each additional assist per game, there will be 0.57 extra lose.
Based on the boxscores of this season, we use model 2 to predict number of winnings for existing 30 teams. By arranging the predicted number of winnings, the Knicks is predicted to have 43.6 winnings this season and rank 8 in the season of 2021-22. According to this result, if the Knicks wants to secure a space for playoff season, it has to improve its performance and tries to win more.
top8_east =
prediction_21_22 %>%
head(8) %>%
left_join(new_season_df, by = c("season_year", "team_abbreviation", "conference")) %>%
group_by(season_year) %>%
mutate(ranking = row_number())
top8_east %>%
select(season_year, team_abbreviation, conference, ranking) %>%
knitr::kable("simple")
| season_year | team_abbreviation | conference | ranking |
|---|---|---|---|
| 2021-22 | CHA | east | 1 |
| 2021-22 | MIL | east | 2 |
| 2021-22 | BKN | east | 3 |
| 2021-22 | PHI | east | 4 |
| 2021-22 | BOS | east | 5 |
| 2021-22 | ATL | east | 6 |
| 2021-22 | CHI | east | 7 |
| 2021-22 | NYK | east | 8 |
Let’s see what Knicks can do to improve its performance and rush into playoff season.
From the above plots, we can conclude that Knicks has to improve its performance in the aspects of turnover and three points. As the model shows, the number of turnover is negatively associated with the number of winning. However, Knicks currently has the second highest number of turnover per game. In addition, its high percentage of three field goal attempt and low three pointer rate among the top 8 of east conference prevent it from getting a good prediction result.
top8_east %>%
ggplot(aes(x = reorder(team_abbreviation, tov_avg), y = tov_avg, fill = team_abbreviation)) +
geom_bar(stat="identity") +
labs(
x = "Team",
y = "Average Turnover per Game",
title = "Top 8 Team Average Turnover (East)"
)

top8_east %>%
ggplot(aes(x = reorder(team_abbreviation, fg3_p), y = fg3_p, fill = team_abbreviation)) +
geom_bar(stat="identity") +
labs(
x = "Team",
y = "Three Pointer Rate",
title = "Top 8 Team Three Field Goal Attempt (East)"
)

top8_east %>%
ggplot(aes(x = reorder(team_abbreviation, fg3_r), y = fg3_r, fill = team_abbreviation)) +
geom_bar(stat="identity") +
labs(
x = "Team",
y = "Three Pointer Rate",
title = "Top 8 Team Three Pointer Rate (East)"
)

top8_east %>%
ggplot(aes(x = reorder(team_abbreviation, stl), y = stl, fill = team_abbreviation)) +
geom_bar(stat="identity") +
labs(
x = "Team",
y = "Average Steal per Game"
)

top8_east %>%
ggplot(aes(x = reorder(team_abbreviation, dreb), y = dreb, fill = team_abbreviation)) +
geom_bar(stat="identity") +
labs(
x = "Team",
y = "Average Defensive Rebound"
)

top8_east %>%
ggplot(aes(x = reorder(team_abbreviation, poss_iso), y = poss_iso, fill = team_abbreviation)) +
geom_bar(stat="identity") +
labs(x = "Team",
y = "Average Isolation")

top8_east %>%
ggplot(aes(x = reorder(team_abbreviation, poss_pr), y = poss_pr, fill = team_abbreviation)) +
geom_bar(stat="identity") +
labs(x = "Team",
y = "Average Pick and Roll")

As three pointer is such a crutial parameter for NBA team to get into playoff season. We decide to look at Knicks’ three pointer shooting data to offer more specific suggestions. In this section, we will compare the overall performance of Knicks to league average, and then, draw the plots for some three pointer team leaders.
In this part, we referred ballr package and Owen’s blog to make the hex plots
library(prismatic)
library(extrafont)
library(cowplot)
p = plot_court(court_themes$light) +
geom_polygon(
data = df,
aes(
x = adj_x,
y = adj_y,
group = hexbin_id,
fill = league_avg_diff,
color = after_scale(clr_darken(fill, .333))),
size = .25) +
scale_x_continuous(limits = c(-27.5, 27.5)) +
scale_y_continuous(limits = c(0, 45)) +
scale_fill_distiller(direction = -1,
palette = "PuOr",
limits = c(-.15, .15),
breaks = seq(-.15, .15, .03),
labels = c("-15%", "-12%", "-9%", "-6%", "-3%", "0%", "+3%", "+6%", "+9%", "+12%", "+15%"),
"3FG Percentage Points vs. League Average") +
guides(fill = guide_legend(
label.position = 'bottom',
title.position = 'top',
keywidth = .45,
keyheight = .15,
default.unit = "inch",
title.hjust = .5,
title.vjust = 0,
label.vjust = 3,
nrow = 1)) +
theme(legend.spacing.x = unit(0, 'cm'),
legend.title = element_text(size = 9),
legend.text = element_text(size = 8),
legend.margin = margin(-10,0,-1,0),
legend.position = 'bottom',
legend.box.margin = margin(-30,0,15,0),
plot.title = element_text(hjust = 0.5, vjust = -1, size = 15),
plot.subtitle = element_text(hjust = 0.5, size = 8, vjust = -.5),
plot.caption = element_text(face = "italic", size = 8),
plot.margin = margin(0, -5, 0, -5, "cm")) +
labs(title = "New York Knicks - Three Point",
subtitle = "2021-22 Regular Season")
ggdraw(p) +
theme(plot.background = element_rect(fill="floralwhite", color = NA))

According to this hex plot, in the 2021-2022 regular season, The Knicks has a better performance in three-point field goal at both of the wing area compared to league average. And it has a equal performance with the league average at the head of the key area. However, the Knicks performs worse at the both corner area compared to the league average. To be more specific, the three pointer percentage at right wing is 6% higher than the league average. At left wing, the three pointer percentage is 3% higher than the league average. On the other hand, the team’s three pointer percentage is 6% and 3% lower than the league average in the right and left corners respectively. Therefore, the Knicks should deploy more three field goal tactics at left and right wing areas. And the shooting ability at corner area should be further strengthened through training.
As to further understand what tactic the Knicks can deploy, we decide to look at the shooting log of three pointer team leaders in Knicks, including Alec Burks, Kemba Walker and Derrick Rose, who have the highest three pointer rate in Knicks.
alec_p =
plot_court(court_themes$light) +
geom_polygon(
data = alec_df,
aes(
x = hex_data.adj_x,
y = hex_data.adj_y,
group = hex_data.hexbin_id,
fill = hex_data.league_avg_diff,
color = after_scale(clr_darken(fill, .333))),
size = .25) +
scale_x_continuous(limits = c(-27.5, 27.5)) +
scale_y_continuous(limits = c(0, 45)) +
scale_fill_distiller(direction = -1,
palette = "PuOr",
limits = c(-.15, .15),
breaks = seq(-.15, .15, .03),
labels = c("-15%", "-12%", "-9%", "-6%", "-3%", "0%", "+3%", "+6%", "+9%", "+12%", "+15%"),
"3FG Percentage Points vs. League Average") +
guides(fill = guide_legend(
label.position = 'bottom',
title.position = 'top',
keywidth = .45,
keyheight = .15,
default.unit = "inch",
title.hjust = .5,
title.vjust = 0,
label.vjust = 3,
nrow = 1)) +
theme(legend.spacing.x = unit(0, 'cm'),
legend.title = element_text(size = 9),
legend.text = element_text(size = 8),
legend.margin = margin(-10,0,-1,0),
legend.position = 'bottom',
legend.box.margin = margin(-30,0,15,0),
plot.title = element_text(hjust = 0.5, vjust = -1, size = 15),
plot.subtitle = element_text(hjust = 0.5, size = 8, vjust = -.5),
plot.caption = element_text(face = "italic", size = 8),
plot.margin = margin(0, -5, 0, -5, "cm")) +
labs(title = "Alec Burks - Three Point",
subtitle = "2021-22 Regular Season")
ggdraw(alec_p) +
theme(plot.background = element_rect(fill="floralwhite", color = NA))

walker_p =
plot_court(court_themes$light) +
geom_polygon(
data = walker_df,
aes(
x = hex_data.adj_x,
y = hex_data.adj_y,
group = hex_data.hexbin_id,
fill = hex_data.league_avg_diff,
color = after_scale(clr_darken(fill, .333))),
size = .25) +
scale_x_continuous(limits = c(-27.5, 27.5)) +
scale_y_continuous(limits = c(0, 45)) +
scale_fill_distiller(direction = -1,
palette = "PuOr",
limits = c(-.15, .15),
breaks = seq(-.15, .15, .03),
labels = c("-15%", "-12%", "-9%", "-6%", "-3%", "0%", "+3%", "+6%", "+9%", "+12%", "+15%"),
"3FG Percentage Points vs. League Average") +
guides(fill = guide_legend(
label.position = 'bottom',
title.position = 'top',
keywidth = .45,
keyheight = .15,
default.unit = "inch",
title.hjust = .5,
title.vjust = 0,
label.vjust = 3,
nrow = 1)) +
theme(legend.spacing.x = unit(0, 'cm'),
legend.title = element_text(size = 9),
legend.text = element_text(size = 8),
legend.margin = margin(-10,0,-1,0),
legend.position = 'bottom',
legend.box.margin = margin(-30,0,15,0),
plot.title = element_text(hjust = 0.5, vjust = -1, size = 15),
plot.subtitle = element_text(hjust = 0.5, size = 8, vjust = -.5),
plot.caption = element_text(face = "italic", size = 8),
plot.margin = margin(0, -5, 0, -5, "cm")) +
labs(title = "Kemba Walker - Three Point",
subtitle = "2021-22 Regular Season")
ggdraw(walker_p) +
theme(plot.background = element_rect(fill="floralwhite", color = NA))

rose_p =
plot_court(court_themes$light) +
geom_polygon(
data = rose_df,
aes(
x = hex_data.adj_x,
y = hex_data.adj_y,
group = hex_data.hexbin_id,
fill = hex_data.league_avg_diff,
color = after_scale(clr_darken(fill, .333))),
size = .25) +
scale_x_continuous(limits = c(-27.5, 27.5)) +
scale_y_continuous(limits = c(0, 45)) +
scale_fill_distiller(direction = -1,
palette = "PuOr",
limits = c(-.15, .15),
breaks = seq(-.15, .15, .03),
labels = c("-15%", "-12%", "-9%", "-6%", "-3%", "0%", "+3%", "+6%", "+9%", "+12%", "+15%"),
"3FG Percentage Points vs. League Average") +
guides(fill = guide_legend(
label.position = 'bottom',
title.position = 'top',
keywidth = .45,
keyheight = .15,
default.unit = "inch",
title.hjust = .5,
title.vjust = 0,
label.vjust = 3,
nrow = 1)) +
theme(legend.spacing.x = unit(0, 'cm'),
legend.title = element_text(size = 9),
legend.text = element_text(size = 8),
legend.margin = margin(-10,0,-1,0),
legend.position = 'bottom',
legend.box.margin = margin(-30,0,15,0),
plot.title = element_text(hjust = 0.5, vjust = -1, size = 15),
plot.subtitle = element_text(hjust = 0.5, size = 8, vjust = -.5),
plot.caption = element_text(face = "italic", size = 8),
plot.margin = margin(0, -5, 0, -5, "cm")) +
labs(title = "Derrick Rose - Three Point",
subtitle = "2021-22 Regular Season")
ggdraw(rose_p) +
theme(plot.background = element_rect(fill="floralwhite", color = NA))

The performance of Knicks team leaders in three pointer is in accordance with the performance of the whole team. In the 2021-22 season, none of the team leaders performs better than the league average in both of the corner area.However, it is more likely for them to make three point at both wings. Therefore, we think the coach should deploy more tactics for the team leaders at wing area. And the players should make less shot attempt in a play at the corner but get more training in shooting at this area.
With all the data analysis and model exploration, we identified features that contribute to the number of winning in a season, game results and play scores. We think it’s necessary for the fans to see how these feature changes in different seasons for different NBA teams, moreover, make comparisons between teams. Therefore, we created an interactive dashboard in Shiny app to achieve this purpose.
In this Shiny App, users can select different feature and team(s) to visualize the team(s) performance in this feature, make comparisons and get ranking predictions for the team(s) in the season of 2021-22.